Information Retrieval and Extraction from the Web: the CROSSMARC approach
نویسندگان
چکیده
The paper presents the CROSSMARC approach for the complex task of identification of interesting web sites and web pages and the extraction of information from them. This task is hard because most of the information on the Web today is in the form of HTML documents, which are designed for presentation purposes and not for automatic extraction systems. This task becomes even harder in a multilingual context, where web pages in different languages need to be considered. CROSSMARC approach focuses on the easy customization of web information retrieval and extraction technology to new domains and languages. This is achieved by adopting and implementing an open, multi-lingual and multi-agent architecture that integrates the CROSSMARC components into a web-based prototype system, as well as by providing an infrastructure that facilitates customization of its components to new domains and languages.
منابع مشابه
Use of Ontologies for Cross-lingual Information Management in the Web
We present the ontology-based approach for crosslingual information management of web content that has been developed by the EC-funded project CROSSMARC. CROSSMARC can be perceived as a meta-search engine, which identifies domainspecific information from the Web. To achieve this, it employs agents for web crawling, spidering, information extraction from web pages, data storage, and data present...
متن کاملCross-lingual Information Extraction from Web pages: the use of a general-purpose Text Engineering Platform
In this paper we present how the use of a general-purpose text engineering platform has facilitated the development of a cross-lingual information extraction system and its adaptation to new domains and languages. Our approach for crosslingual information extraction from the Web covers all the way from the identification of Web sites of interest, to the location of the domainspecific Web pages,...
متن کاملAssessing the Internal Structure of the Ellis Information Retrieval Model in Order to Present the Persian Norm of Web Retrieval Tools
Introduction: Study evaluated the internal structure of Ellis information seeking model in the student community with the aim of presenting the Persian norm. Methods: This is a descriptive-analytical study conducted by cross-sectional survey method in the second semester of the academic year 1399-1400. Population comprise of 280 graduate students at Ahvaz Jundishapur University of Medical Scien...
متن کاملBehavioral Considerations in Developing Web Information Systems: User-centered Design Agenda
The current paper explores designing a web information retrieval system regarding the searching behavior of users in real and everyday life. Designing an information system that is closely linked to human behavior is equally important for providers and the end users. From an Information Science point of view, four approaches in designing information retrieval systems were identified as system-...
متن کاملMulti-lingual XML-Based Named Entity Recognition in Web Pages
We describe the multilingual Named Entity Recognition and Classification (NERC) subpart of an information extraction system, which is currently under development as part of the EU-funded project CROSSMARC. The two main CROSSMARC goals are to develop commercial-strength technologies based on language processing methodologies for information extraction from web pages and to provide automated tech...
متن کامل